The state-of-the-art performance for object detection has been significantlyimproved over the past two years. Besides the introduction of powerful deepneural networks such as GoogleNet and VGG, novel object detection frameworkssuch as R-CNN and its successors, Fast R-CNN and Faster R-CNN, play anessential role in improving the state-of-the-art. Despite their effectivenesson still images, those frameworks are not specifically designed for objectdetection from videos. Temporal and contextual information of videos are notfully investigated and utilized. In this work, we propose a deep learningframework that incorporates temporal and contextual information from tubeletsobtained in videos, which dramatically improves the baseline performance ofexisting still-image detection frameworks when they are applied to videos. Itis called T-CNN, i.e. tubelets with convolutional neueral networks. Theproposed framework won the recently introduced object-detection-from-video(VID) task with provided data in the ImageNet Large-Scale Visual RecognitionChallenge 2015 (ILSVRC2015).
展开▼